Screenshot%202022-07-22%20175329.png

image.png

An approach to detecting and understanding Machine Learning Biases using classes

Table of Contents

1. Background on Marketing Bias and the Modcloth Dataset

2. FairDetect Framework

3. Aequitas Framework

1. Background on Marketing Bias and the Modcloth Dataset

In the electronic marketplace and online retail, recommender systems are widely used as decision aids. It is well know as well that, online recommendations have a big influence on many consumers' decisions. Recent studies indicate that online suggestions can manipulate consumers' preferences ratings and also also their readiness to buy certain merchandise.

Recommendation algorithms, which gather and generalize user preference patterns from recorded consumer-product interactions as pruchaces and raitings, often fall under the category of collaborative filtering.

These feedback exchanges could present consumers with unfair (or irrelevant) recommendations or underrepresented items in the input data because of various biases that may be at play.

A common hypothesis (known as ‘selfcongruence’) is that a consumer may tend to buy a product because its public impression (in our case a product image), among other alternatives, is consistent with one’s self-perceptions (user identity). Based on this assumption, the selection of human models for a product could influence a consumer’s behavior. Studies indicate that generally, there are more interactions than expected on the consumer-product segments where users’ identities match the product images (‘self-congruity’), while several market segments are underrepresented in the data. For example, (‘Large’ user, ‘Small’ product). (https://dl.acm.org/doi/pdf/10.1145/3336191.3371855)

Under this premise, we will use the Modcloth dataset, to test bias both on Fairdetect and Aequitas framework. We will look for bias in the creation of a machine learning model to predict if an marketing strategy could affect consumer's behavour resulting in a biased interaction dataset, which is commonly used as the input for modern recommendeer systems.

ModCloth is an e-commerce website which sells women’s clothing and accessories.* Many products in ModCloth include two human models with different body shapes and measurements of these models. Users can optionally provide the product sizes they purchased and fit feedback (‘Just Right’, ‘Slightly Larger’, ‘Larger’, ‘Slightly Smaller’ or ‘Smaller’) along with their reviews.

Therefore our source of bias is the dimension of human body shape. There are 2 variables of interest:

User identity: is the perception of oneself; It calculates the average size each user purchased and classify users into ‘Small’ and ‘Large’ groups based on the same standard as the product body shape image.

Product image: the public impression of a product; attributes of the human models included in the product pictures are used to generate this data set. Products with only one human model wearing a relatively small size (‘XS’, ‘S’, ‘M’ or ‘L’) are labeled as the ‘Small’ group while products with two models (an additional model wearing a plus-size: 1X’, ‘2X’, ‘3X’ or ‘4X’) are referred as the ‘Small&Large’ group

With the use of Fairdetect and Aequitas frameworks, we want to understand the existence of the association between product image and user identity in consumers’ product selections.

2. FairDetect Framework

Congregating the various theoretical concepts into a practical framework, we can follow the “theoretical lens of a ‘sense-plan-act’ cycle”, as described by the HLEG framework (European Commission and Directorate-General for Communications Networks, Content and Technology, 2019). Applying this concept to the problem of ML fairness, we can break down three core steps in providing robust, and responsible artificial intelligence: Identify, Understand, and Act (IUA).

  1. Identify: The process of exposing direct or indirect biases within a dataset and/or model.
  2. Understand: The process of isolating impactful scenarios and obtaining trans parent explanations for outcomes.
  3. Act: The process of reporting and rectifying identified disparities within the

By understanding the philosophical forms of unfairness as defined by our review of the literature and categorizing our prominent fairness metrics into the overarching categories of representation, ability, and performance, we can establish a series of tests to “identify” levels of disparities between sensitive groups at different levels. Merging these findings with the explainability of our models through the use of white-box models, or Shapley value estimation for black-box models, we can dig deeper into the model’s predictions, “understanding” how classifications were made, and how they varied from the natural dataset exposing both natural biases as well as added model differences. Finally, by probing further into levels of misclassification, in particular looking at negative outcomes, we can isolate groups most at risk and set up a series of “actions” that can be taken to mitigate the effects. Given this three-step framework which combines societal, legal, and technical considerations, the paper will then go through a series of cases, and examine the proposed framework.

2.1 Importing Relevant Libraries

2.2 Loading Dataset

Transforming/ Binarizing

2.3 Data Exploration

2.4 Training the Machine Learning Model

Target Variable Definition

Splitting the data into train and test

Model Training

Creating the Object

2.5 Bias Detection

Representation of Sensitive Variables

The analysis is split in 3 parts in order to identify areas of bias. It will break down identification of bias into 3 sessions

REPRESENTATON: Comparison of the sensitive attribute User Identity Group (Large=0 and Small=1) and the target variable (Model Attribute: Product Image: Small=0, Small&large=1).

Demographic Parity: association of the target variable v.s. the sensitive variable

P-Value: Reject the null Hipothesis of non significance relation between the sensitive variable and the target variable. This means THERE IS a relationship between the user identity and the product image.

ABILITY: Analysing specific sensitive groups Regardless of the sensitive background, there should be a 50/50 ratio in a fair scenario.

In the false negative rate (FNR) there is a significant difference betweek large and small,large enough to reject the null hypothesis, meaning there is presense of false negative disparity and there is a bigger chances of marketing missrepresentaton of users identified with large sizes than users identified with small sizes

PREDICTION: The model is not further exacervating the anyobserveb bias. Whatever is present in the dataset is observed in the predicitions.

2.6 Model Importance Comparison

Affected Group = 0 = Large Affected Target = 0 = Small

Brand is the most significant variable for both user identity types, large and small, however is relatively more important for small size users than large size users.

Fit is twice as relevant for users dentified with large sizes compared to people identified with small sizes.

For the affected group and target, most relevant variable is category, followed by size and fit, which is significantly more relevant for this group than for the entire population

2.7 Disparate impact

The disparate impact is a metric to evalute the fairness. It differentiates between an unpriviliged/unfavoured group and a priviliged/favoured group. In the calculation is the proporition of the unpriviliged/unfavoured group that receives the positive outcome of the event, divided by the proporiton of priviliged group that receiveds the positive outcome.

The results is interpreted by the four-fifths rule: if the unprivileged group has a less positive outcome than 80% compared to the priviliged group there a diparate impact violation.

In this analysis we also defined that a ratio of 80-90% is a sign of mild impact violation. A ratio of 1 is an indication for perfect equality.

2.8 Saving to SQL lite3

In case we want to calculate further with the KPIs of the fairdetect method at a later stage and do not want to call of the fuctions above, we could store the intermediate results in SQLite.

We first create the database credict card approval for all tables related to the dataset. Then we connect to the database and create the table pvalues that should store the pvalues of TPR, TNR, FPR and FNR.

Through the inspector method it is possible to check which tables are in the respective database and which columns the tables have

Furthermore we can also look at the metadata of the table and see what type of columns it has.

After the table was successfully created, we follow the same methodology as in fairdetect to first create the object of data entry and then enter the dynamic data itself.

We then call the object with the respective method of data entry and hand over the most recent p-values of TRP, FPR, TNR, FNR that resulted from the fairdetect analysis

In order to check if the data was inserted we check for the most recent data entries in the database of credict card approval and print the latest five results

3. AEQUITAS FRAMEWORK

We are now ready to utilize AEQUITAS to detect bias.

The Aequitas toolkit is a flexible bias-audit utility for algorithmic decision-making models, accessible via Python API. It will help us out evaluate the performance of the model across several bias and fairness metrics. Here are the steps involved:

1) Understand where biases exist in Modcloth dataset and in the model

2) Compare the level of bias between groups in our sample population (bias disparity)

3) Assess model Fairness and Visualize absolute bias metrics and their related disparities for rapid comprehension and decision-making

Fairness-Short-Tree.png

3.1 Importing Relevant Libraries

As with any Python program, the first step will be to import the necessary packages. Below we import several components from the Aequitas package. We also import some other non-Aequitas useful packages.

3.2 Loading and Reformating Datase to fit Aequitas framework

Now that we've identified the protected attribute 'user_attr' and defined privileged and unprivileged values, we can use AEQUITAS to detect bias in the dataset

3.3. Model Biases

What is the distribution of groups, predicted scores, and labels across my dataset?

Aequitas’s Group() class enables researchers to evaluate biases across all subgroups in their dataset by assembling a confusion matrix of each subgroup, calculating commonly used metrics such as false positive rate and false omission rate, as well as counts by group and group prevelance among the sample population.

False Positive Rate (FPR) is the fraction of individuals who identified with a large size the model misclassifies with small model imange. FPR is quite low accross all groups and labels.

False Negative Rate (FNR) is the fraction of individuals identified with a small body size and the model misclassifies with a small & large size model image. 1 groups raise our concerns here: very small size fit group seem to be given very offten a wrong model image compare to the rest of the fit attributes.

False Discovery Rate (FDR) is the fraction of individuals who the model predicts to have an small size but for whom the their individual perception of their body is large. Very large size fit seems to be impacted in this category.

Visualizing a single absolute group metric across all population groups

The chart below displays group metric predictive positive rate (ppr) calculated across each attribute, colored based on number of samples in the attribute group.

We can see from the longer bars that across ‘rating’, ‘user_attribute’, and ‘fit’ attributes, the groups Modcloth incorrectly predicts as small & large image profile most often are rated excellent,with ajust right judgement in fit and identified as small size. From the darker coloring, we can also tell that these are the three largest populations in the data set

View group metrics for only groups over a certain size threshold

Extremely small group sizes increase standard error of estimates, and could be factors in prediction error such as false negatives, hence we are using the min_group parameter to vizualize only those sample population groups above a user-specified percentage of the total sample size.

3.4 Model Disparities

We use the Aequitas Bias() class to calculate disparities between groups based on the crosstab returned by the Group() class get_crosstabs() method described above.

Disparities are calculated as a ratio of a metric for a group of interest compared to a base group.

The treemap below displays precision disparity values calculated using a predefined group, in this case the ‘Small’ group within the user_attr attribute, sized based on the group size and colored based on disparity magnitude. The farther from 1 the more disparity exist among the groups

Visualizing parity of a single absolute group metric across all population groups

3.5 Overall Fairness

The chart below displays absolute group metric Predicted Positive Rate Disparity (ppr) across each attribute, colored based on fairness determination for that attribute group (green = ‘True’ and red = ‘False’).

We can see from the green color that only the excelent rating, small user attribute category, and just right fit groups have been determined to be fair. These are the groups selected as reference groups, so this model is not fair in terms of Statistical Parity for any of the other groups.

The Parity Test

The Parity Test graphs is another visualization aid that help us dentify where the bias is based on a defined disparty tolerance tha can be adjusted according to the results a research desires to evaluate. Fot the sake of this exercise we have selected a disparity tolerance of 1.25

As per the results we can observed for each one of the categories in the Mondcloth, what is the level of disparity compared to the reference group for each one of the variables